NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

PepBERT: Lightweight language models for bioactive peptide representation

https://doi.org/10.1101/2025.04.08.647838

Du, Zhenjiao; Caragea, Doina; Guo, Xiaolong; Li, Yonghui (April 2025, bioRxiv)

Abstract Protein language models (pLMs) have been widely adopted for various protein and peptide-related downstream tasks and demonstrated promising performance. However, short peptides are significantly underrepresented in commonly used pLM training datasets. For example, only 2.8% of sequences in the UniProt Reference Cluster (UniRef) contain fewer than 50 residues, which potentially limits the effectiveness of pLMs for peptide-specific applications. Here, we present PepBERT, a lightweight and efficient peptide language model specifically designed for encoding peptide sequences. Two versions of the model—PepBERT-large (4.9 million parameters) and PepBERT-small (1.86 million parameters)—were pretrained from scratch using four custom peptide datasets and evaluated on nine peptide-related downstream prediction tasks. Both PepBERT models achieved performance superior to or comparable to the benchmark model, ESM-2 with 7.5 million parameters, on 8 out of 9 datasets. Overall, PepBERT provides a compact yet effective solution for generating high-quality peptide representations for downstream applications. By enabling more accurate representation and prediction of bioactive peptides, PepBERT can accelerate the discovery of food-derived bioactive peptides with health-promoting properties, supporting the development of sustainable functional foods and value-added utilization of food processing by-products. The datasets, source codes, pretrained models, and tutorials for the usage of PepBERT are available athttps://github.com/dzjxzyd/PepBERT.
more » « less
Free, publicly-accessible full text available April 14, 2026
FusionESP: Improved Enzyme–Substrate Pair Prediction by Fusing Protein and Chemical Knowledge

https://doi.org/10.1021/acs.jcim.4c02357

Du, Zhenjiao; Fu, Weimin; Guo, Xiaolong; Caragea, Doina; Li, Yonghui (March 2025, Journal of Chemical Information and Modeling)

Free, publicly-accessible full text available March 24, 2026
Disaster Image Classification Using Pre-trained Transformer and Contrastive Learning Models

https://doi.org/10.1109/DSAA60987.2023.10302517

Dinani, Soudabeh; Caragea, Doina (October 2023, The 10th IEEE International Conference on Data Science and Advanced Analytics (DSAA))

Natural disasters can have devastating consequences for communities, causing loss of life and significant economic damage. To mitigate these impacts, it is crucial to quickly and accurately identify situational awareness and actionable information useful for disaster relief and response organizations. In this paper, we study the use of advanced transformer and contrastive learning models for disaster image classification in a humanitarian context, with focus on state-of-the-art pre-trained vision transformers such as ViT, CSWin and a state-of-the-art pre-trained contrastive learning model, CLIP. We evaluate the performance of these models across various disaster scenarios, including in-domain and cross-domain settings, as well as few- shot learning and zero-shot learning settings. Our results show that the CLIP model outperforms the two transformer models (ViT and CSWin) and also ConvNeXts, a competitive CNN-based model resembling transformers, in all the settings. By improving the performance of disaster image classification, our work can contribute to the goal of reducing the number of deaths and economic losses caused by disasters, as well as helping to decrease the number of people affected by these events.
more » « less
Full Text Available
Using recurrent neural networks to detect supernumerary chromosomes in fungal strains causing blast diseases

https://doi.org/10.1093/nargab/lqae108

Gyawali, Nikesh; Hao, Yangfan; Lin, Guifang; Huang, Jun; Bika, Ravi; Daza, Lidia Calderon; Zheng, Huakun; Cruppe, Giovana; Caragea, Doina; Cook, David; et al (July 2024, NAR Genomics and Bioinformatics)

Abstract The genomes of the fungus Magnaporthe oryzae that causes blast diseases on diverse grass species, including major crops, have indispensable core-chromosomes and may contain supernumerary chromosomes, also known as mini-chromosomes. These mini-chromosomes are speculated to provide effector gene mobility, and may transfer between strains. To understand the biology of mini-chromosomes, it is valuable to be able to detect whether a M. oryzae strain possesses a mini-chromosome. Here, we applied recurrent neural network models for classifying DNA sequences as arising from core- or mini-chromosomes. The models were trained with sequences from available core- and mini-chromosome assemblies, and then used to predict the presence of mini-chromosomes in a global collection of M. oryzae isolates using short-read DNA sequences. The model predicted that mini-chromosomes were prevalent in M. oryzae isolates. Interestingly, at least one mini-chromosome was present in all recent wheat isolates, but no mini-chromosomes were found in early isolates collected before 1991, indicating a preferential selection for strains carrying mini-chromosomes in recent years. The model was also used to identify assembled contigs derived from mini-chromosomes. In summary, our study has developed a reliable method for categorizing DNA sequences and showcases an application of recurrent neural networks in predictive genomics.
more » « less
Full Text Available
Semi-Supervised Few-Shot Learning for Fine-Grained Disaster Tweet Classification

https://doi.org/10.59297/FWXE4933

Zou, Henry Peng; Caragea, Cornelia; Zhou, Yue; and Caragea, Doina (May 2023, Proceedings of the 20th International ISCRAM Conference)
Radianti, Jaziar; Dokas, Ioannis; Lalone, Nicolas; Khazanchi, Deepak (Ed.)
The shared real-time information about natural disasters on social media platforms like Twitter and Facebook plays a critical role in informing volunteers, emergency managers, and response organizations. However, supervised learning models for monitoring disaster events require large amounts of annotated data, making them unrealistic for real-time use in disaster events. To address this challenge, we present a fine-grained disaster tweet classification model under the semi-supervised, few-shot learning setting where only a small number of annotated data is required. Our model, CrisisMatch, effectively classifies tweets into fine-grained classes of interest using few labeled data and large amounts of unlabeled data, mimicking the early stage of a disaster. Through integrating effective semi-supervised learning ideas and incorporating TextMixUp, CrisisMatch achieves performance improvement on two disaster datasets of 11.2% on average. Further analyses are also provided for the influence of the number of labeled data and out-of-domain results.
more » « less
Full Text Available
A Comparison Study for Disaster Tweet Classification Using Deep Learning Models

https://doi.org/10.5220/0012129300003541

Dinani, Soudabeh; Caragea, Doina (January 2023, Proceedings of the 12th International Conference on Data Science, Technology and Applications DATA)

Effectively filtering and categorizing the large volume of user-generated content on social media during disaster events can help emergency management and disaster response prioritize their resources. Deep learning approaches, including recurrent neural networks and transformer-based models, have been previously used for this purpose. Capsule Neural Networks (CapsNets), initially proposed for image classification, have been proven to be useful for text analysis as well. However, to the best of our knowledge, CapsNets have not been used for classifying crisis-related messages, and have not been extensively compared with state-of-the-art transformer-based models, such as BERT. Therefore, in this study, we performed a thorough comparison between CapsNet models, state-of-the-art BERT models and two popular recurrent neural network models that have been successfully used for tweet classification, specifically, LSTM and Bi-LSTM models, on the task of classifying crisis tweets both in terms of their informativeness (binary classification), as well as their humanitarian content (multi-class classification). For this purpose, we used several benchmark datasets for crisis tweet classification, namely CrisisBench, CrisisNLP and CrisisLex. Experimental results show that the performance of the CapsNet models is on a par with that of LSTM and Bi-LSTM models for all metrics considered, while the performance obtained with BERT models have surpassed the performance of the other three models across different datasets and classes for both classification tasks, and thus BERT could be considered the best overall model for classifying crisis tweets.
more » « less
Full Text Available
LMNglyPred: prediction of human N -linked glycosylation sites using embeddings from a pre-trained protein language model

https://doi.org/10.1093/glycob/cwad033

Pakhrin, Subash C.; Pokharel, Suresh; Aoki-Kinoshita, Kiyoko F.; Beck, Moriah R.; Dam, Tarun K.; Caragea, Doina; KC, Dukka B. (April 2023, Glycobiology)

Abstract Protein N-linked glycosylation is an important post-translational mechanism in Homo sapiens, playing essential roles in many vital biological processes. It occurs at the N-X-[S/T] sequon in amino acid sequences, where X can be any amino acid except proline. However, not all N-X-[S/T] sequons are glycosylated; thus, the N-X-[S/T] sequon is a necessary but not sufficient determinant for protein glycosylation. In this regard, computational prediction of N-linked glycosylation sites confined to N-X-[S/T] sequons is an important problem that has not been extensively addressed by the existing methods, especially in regard to the creation of negative sets and leveraging the distilled information from protein language models (pLMs). Here, we developed LMNglyPred, a deep learning-based approach, to predict N-linked glycosylated sites in human proteins using embeddings from a pre-trained pLM. LMNglyPred produces sensitivity, specificity, Matthews Correlation Coefficient, precision, and accuracy of 76.50, 75.36, 0.49, 60.99, and 75.74 percent, respectively, on a benchmark-independent test set. These results demonstrate that LMNglyPred is a robust computational tool to predict N-linked glycosylation sites confined to the N-X-[S/T] sequon.
more » « less
DeepNGlyPred: A Deep neural network-based approach for human N-linked glycosylation site prediction

Pakhrin, Subash; Aoki-Kinoshita, Kiyoko; Caragea, Doina; Dukka, KC (December 2021, Molecules)

Abstract Protein N-linked glycosylation is a post-translational modification that plays an important role in a myriad of biological processes. Computational prediction approaches serve as complementary methods for the characterization of glycosylation sites. Most of the existing predictors for N-linked glycosylation utilize the information that the glycosylation site occurs at the N-X-[S/T] sequon, where X is any amino acid except proline. Not all N-X-[S/T] sequons are glycosylated, thus the N-X-[S/T] sequon is a necessary but not sufficient determinant for protein glycosylation. In that regard, computational prediction of N-linked glycosylation sites confined to N-X-[S/T] sequons is an important problem. Here, we report DeepNGlyPred a deep learning-based approach that encodes the positive and negative sequences in the human proteome dataset (extracted from N-GlycositeAtlas) using sequence-based features (gapped-dipeptide), predicted structural features, and evolutionary information. DeepNGlyPred produces SN, SP, MCC, and ACC of 88.62%, 73.92%, 0.60, and 79.41%, respectively on N-GlyDE independent test set, which is better than the compared approaches. These results demonstrate that DeepNGlyPred is a robust computational technique to predict N-Linked glycosylation sites confined to N-X-[S/T] sequon. DeepNGlyPred will be a useful resource for the glycobiology community.
more » « less
Full Text Available
Disaster Image Classification Using Capsule Networks

https://doi.org/10.1109/IJCNN52387.2021.9534448

Dinani, Soudabeh Taghian; Caragea, Doina (July 2021, 2021 International Joint Conference on Neural Networks (IJCNN))

Full Text Available
DeepNGlyPred: A Deep Neural Network-Based Approach for Human N-Linked Glycosylation Site Prediction

https://doi.org/10.3390/molecules26237314

Pakhrin, Subash C.; Aoki-Kinoshita, Kiyoko F.; Caragea, Doina; KC, Dukka B. (December 2021, Molecules)

Protein N-linked glycosylation is a post-translational modification that plays an important role in a myriad of biological processes. Computational prediction approaches serve as complementary methods for the characterization of glycosylation sites. Most of the existing predictors for N-linked glycosylation utilize the information that the glycosylation site occurs at the N-X-[S/T] sequon, where X is any amino acid except proline. Not all N-X-[S/T] sequons are glycosylated, thus the N-X-[S/T] sequon is a necessary but not sufficient determinant for protein glycosylation. In that regard, computational prediction of N-linked glycosylation sites confined to N-X-[S/T] sequons is an important problem. Here, we report DeepNGlyPred a deep learning-based approach that encodes the positive and negative sequences in the human proteome dataset (extracted from N-GlycositeAtlas) using sequence-based features (gapped-dipeptide), predicted structural features, and evolutionary information. DeepNGlyPred produces SN, SP, MCC, and ACC of 88.62%, 73.92%, 0.60, and 79.41%, respectively on N-GlyDE independent test set, which is better than the compared approaches. These results demonstrate that DeepNGlyPred is a robust computational technique to predict N-Linked glycosylation sites confined to N-X-[S/T] sequon. DeepNGlyPred will be a useful resource for the glycobiology community.
more » « less
Full Text Available

« Prev Next »

Search for: All records